Predictive Modeling in Enterprise Miner Versus Regression

نویسنده

  • Patricia B. Cerrito
چکیده

We investigate the difference between regression models in SAS/Stat and compare them to the predictive models in Enterprise Miner. In large samples, the p-value becomes meaningless because the effect size is virtually zero. Therefore, there must be another way to determine the adequacy of the model. In addition, logistic regression cannot be used to predict rare occurrences. Such a model will be highly accurate, generally predicting all occurrences as non-occurrence. However, it will have no practical use whatsoever in identifying those at high risk. In contrast, predictive modeling in Enterprise Miner was designed to accommodate large samples and rare occurrences as well as providing many measures of model adequacy. INTRODUCTION Predictive modeling includes regression, both logistic and linear, depending upon the type of outcome variable. It can also include the generalized linear model. However, there are other types of models also available, including decision trees and artificial neural networks under the general term of predictive modeling. Predictive modeling includes nearest neighbor discriminant analysis, also known as memory based reasoning. These other models are nonparametric and do not require that you know the probability distribution of the underlying patient population. Therefore, they are much more flexible when used to examine patient outcomes. Because predictive modeling uses regression in addition to these other models, the end results will improve upon those found using just regression by itself. Some, but not all, of the predictive models require that all of the x-variables are independent. However, predictive models must still also generally assume the uniformity of data entry. Because of the flexibility in the use of variables to define confounding factors, we can consider the presence or absence of uniformity in the model itself. We can define a variable to model outcome, and to see how the inputs impact the severity outcome. Since the datasets used in predictive modeling are generally too large for a p-value to have meaning, predictive modeling uses other measures of model fit. Generally, too, there are enough observations so that the data can be partitioned into two or more datasets. The first subset is used to define (or train) the model. The second subset can be used in an iterative process to improve the model. The third subset is used to test the model for accuracy. It is also known as a holdout sample. The definition of “best” model needs to be considered in this context as well. Just what do we mean by “best”? In a regression model, the “best” model is one that satisfies the criterion of uniform minimum variance unbiased estimator. In other words, it is only “best” in the class of unbiased estimators. As soon as the class of estimators is expanded, “best” no longer exists, and we must define the criteria that we will use to determine a “best” fit. There are several criteria to consider. For a binary outcome variable, we can use the misclassification rate. However, especially in medicine, misclassification can have different costs. For example, a false positive error is not as costly as a false negative error if the outcome involves the diagnosis of a terminal disease. Another difference when using predictive modeling is that many different models can be used, and compared to find the one that is the best. We can use the traditional regression, but also decision trees and neural network analysis. We can combine different models to define a new model. Generally, use of multiple models has been frowned upon because it is possible to “shop” for one that is effective. Indeed, the nearest neighbor discriminant analysis can always find a model that predicts correctly 100% of the time when defining the model, but predicts 0% of the time for any subsequent data. When using multiple models, it is essential to define a holdout sample that can be used to test the results. BACKGROUND Predictive modeling routinely makes use of a holdout sample to test the accuracy of the results. Figure 1 demonstrates predictive modeling. In SAS, there are two different regression models, three different neural network models, and two decision tree models. There is also a memory based reasoning model, otherwise known as nearest neighbor discriminant analysis. These models are discussed in detail in Cerrito (2007). It is not our intent here to provide an introductory text on neural networks; instead, we will demonstrate how they can be used effectively to investigate the outcome data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigating Open Source Project Success: A Data Mining Approach to Model Formulation, Validation and Testing

This paper demonstrates the use of Data Mining (DM) techniques in exploratory research. A robust model for identifying the factors that explain the success of Open Source Software (OSS) projects is created, validated and tested. The predictive modeling techniques of Logistic Regression (LR), Decision Trees (DT) and Neural Networks (NN) are used together in this analysis. Using Text Mining resul...

متن کامل

155-2008: Cool New Features in SAS® Enterprise MinerTM 5.3

SAS released Enterprise Miner 5.3 in late 2007 with a veritable plethora of cool new features for data miners everywhere. Nearly every module of the software has been updated. New interactive data preparation tools make it easier to manipulate data and construct a sample for mining. For data exploration, Enterprise Miner now supports hierarchical market baskets to isolate interesting rules at d...

متن کامل

Combining Decision Trees with Regression in Predictive Modeling with SAS® Enterprise MinerTM

The purpose of this paper is to illustrate how the Decision Tree node can be used to optimally bin the inputs for use in a logistic regression. Binning can be viewed as a complex non-linear transformation of the inputs. By utilizing this technique, we may be able to capture complex non-linear relationships between an input variable and the target variable. Univariate trees are created by using ...

متن کامل

Comparative Analysis of Neural Network Models for Premises Valuation Using SAS Enterprise Miner

The experiments aimed to compare machine learning algorithms to create models for the valuation of residential premises were conducted using the SAS Enterprise Miner 5.3. Eight different algorithms were used including artificial neural networks, statistical regression and decision trees. All models were applied to actual data sets derived from the cadastral system and the registry of real estat...

متن کامل

Predicting Workers’ Compensation Insurance Fraud Using SAS Enterprise Miner 5.1 and SAS Text Miner

Insurance fraud costs the property and casualty insurance industry over 25 billion dollars (USD) annually. This paper addresses workers' compensation claim fraud. A data mining approach is adopted, and issues of data preparation are discussed. The focus is on building predictive models to score an open claim for a propensity to be fraudulent. A key component to modeling is the use of textual da...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009